During the summer of 2012, wild fires ravaged throughout the Algerian territory covering most of the northern part, especially the coastal cities. This disaster was due to the higher than average temperatures which reached as high as 50 degrees Celcius.
One important measure against the reproduction of such disasters is the ability to predict their occurrence. Moreover, in this project, we will attempt to predict these forest fires based on multiple features related to weather indices.
The Dataset we will use to train and test our models consists of 244 observations on two Algerian Wilayas (cities): Sidi-Bel Abbes and Bejaia. The observations have been gathered throughout the duration of 4 months from June to September 2012 for both cities.
The Dataset contains the following variables:
We first start off by importing the necessary libraries for our analysis.
[INSERT DESCRIPTION OF EACH LIBRARY]
The Dataset provided to us was in the form of a .csv file that contained two tables, one table for the observations belonging to the Sidi-Bel Abbes region, and the other for Bejaia.
Before starting our analysis we separated the tables into two distinct files according to the region. We named both files Algerian_forest_fires_dataset_Bejaia.csv and Algerian_forest_fires_dataset_Sidi_Bel_Abbes.csv for Bejaia and Sidi-Bel Abbes respectively.
We first check the existence of null values in the Dataset, none were found.
colSums(is.na(df_b))
day month year Temperature RH Ws Rain FFMC
0 0 0 0 0 0 0 0
DMC DC ISI BUI FWI Classes
0 0 0 0 0 0
colSums(is.na(df_s))
day month year Temperature RH Ws Rain FFMC
0 0 0 0 0 0 0 0
DMC DC ISI BUI FWI Classes
0 0 0 0 0 0
We then process to add a column in both datasets to indicate the region(Wilaya) in each table. We chose the following encoding:
df_b[["Region"]] = 0
df_s[["Region"]] = 1
After that, we proceed to merge both our datasets into one single dataframe using full_join(), this will allow us to easily explore and analyze the data.
df_s$DC <- as.double(df_s$DC)
Avis : NAs introduced by coercion
df_s$FWI <- as.double(df_s$FWI)
Avis : NAs introduced by coercion
df = full_join(df_s, df_b)
Joining, by = c("day", "month", "year", "Temperature", "RH", "Ws", "Rain", "FFMC", "DMC", "DC", "ISI", "BUI", "FWI", "Classes", "Region")
dim(df)
[1] 244 15
str(df)
'data.frame': 244 obs. of 15 variables:
$ day : int 1 2 3 4 5 6 7 8 9 10 ...
$ month : int 6 6 6 6 6 6 6 6 6 6 ...
$ year : int 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
$ Temperature: int 32 30 29 30 32 35 35 28 27 30 ...
$ RH : int 71 73 80 64 60 54 44 51 59 41 ...
$ Ws : int 12 13 14 14 14 11 17 17 18 15 ...
$ Rain : num 0.7 4 2 0 0.2 0.1 0.2 1.3 0.1 0 ...
$ FFMC : num 57.1 55.7 48.7 79.4 77.1 83.7 85.6 71.4 78.1 89.4 ...
$ DMC : num 2.5 2.7 2.2 5.2 6 8.4 9.9 7.7 8.5 13.3 ...
$ DC : num 8.2 7.8 7.6 15.4 17.6 26.3 28.9 7.4 14.7 22.5 ...
$ ISI : num 0.6 0.6 0.3 2.2 1.8 3.1 5.4 1.5 2.4 8.4 ...
$ BUI : num 2.8 2.9 2.6 5.6 6.5 9.3 10.7 7.3 8.3 13.1 ...
$ FWI : num 0.2 0.2 0.1 1 0.9 3.1 6 0.8 1.9 10 ...
$ Classes : chr "not fire " "not fire " "not fire " "not fire " ...
$ Region : num 1 1 1 1 1 1 1 1 1 1 ...
summary(df)
day month year Temperature RH Ws
Min. : 1.00 Min. :6.0 Min. :2012 Min. :22.00 Min. :21.00 Min. : 6.0
1st Qu.: 8.00 1st Qu.:7.0 1st Qu.:2012 1st Qu.:30.00 1st Qu.:52.00 1st Qu.:14.0
Median :16.00 Median :7.5 Median :2012 Median :32.00 Median :63.00 Median :15.0
Mean :15.75 Mean :7.5 Mean :2012 Mean :32.17 Mean :61.94 Mean :15.5
3rd Qu.:23.00 3rd Qu.:8.0 3rd Qu.:2012 3rd Qu.:35.00 3rd Qu.:73.25 3rd Qu.:17.0
Max. :31.00 Max. :9.0 Max. :2012 Max. :42.00 Max. :90.00 Max. :29.0
Rain FFMC DMC DC ISI
Min. : 0.0000 Min. :28.60 Min. : 0.70 Min. : 6.90 Min. : 0.000
1st Qu.: 0.0000 1st Qu.:72.08 1st Qu.: 5.80 1st Qu.: 12.35 1st Qu.: 1.400
Median : 0.0000 Median :83.50 Median :11.30 Median : 33.10 Median : 3.500
Mean : 0.7607 Mean :77.89 Mean :14.67 Mean : 49.43 Mean : 4.774
3rd Qu.: 0.5000 3rd Qu.:88.30 3rd Qu.:20.75 3rd Qu.: 69.10 3rd Qu.: 7.300
Max. :16.8000 Max. :96.00 Max. :65.90 Max. :220.40 Max. :19.000
NA's :1
BUI FWI Classes Region
Min. : 1.10 Min. : 0.000 Length:244 Min. :0.0
1st Qu.: 6.00 1st Qu.: 0.700 Class :character 1st Qu.:0.0
Median :12.25 Median : 4.200 Mode :character Median :0.5
Mean :16.66 Mean : 7.035 Mean :0.5
3rd Qu.:22.52 3rd Qu.:11.450 3rd Qu.:1.0
Max. :68.00 Max. :31.100 Max. :1.0
NA's :1
unique(df$year)
[1] 2012
unique(df$month)
[1] 6 7 8 9
We check again for any NA values that might have been introduced into the dataset by merging the data from both tables, we found out there was one row that contained NA value in DC and FWI. We delete that row since it will not affect our overall dataset.
colSums(is.na(df))
day month year Temperature RH Ws Rain FFMC
0 0 0 0 0 0 0 0
DMC DC ISI BUI FWI Classes Region
0 1 0 0 1 0 0
df = df %>% drop_na(DC)
dim(df)
[1] 243 15
We now proceed to display the different range of values some categorical variables might contain, mainly the Classes and the Region columns.
unique(df$Classes)
[1] "not fire " "fire " "not fire " "not fire " "fire"
[6] "fire " "not fire" "not fire "
unique(df$Region)
[1] 1 0
We find that the Classes column has values that contain unneeded space characters, we proceed to trim those spaces.
df$Classes <- trimws(df$Classes, which = c("both"))
unique(df$Classes)
[1] "not fire" "fire"
df = df %>% drop_na(Classes)
df$Classes <- mapvalues(df$Classes, from=c("not fire","fire"), to=c(0,1))
unique(df$Classes)
[1] "0" "1"
df$Classes <- as.numeric(df$Classes)
st(df)
df <- df[-c(3)]
df_scaled = df
df_scaled[-c(1,2,13,14)] <- scale(df[-c(1,2,13,14)])
st(df_scaled)
We have ended up with a clean and scaled dataframe named df_scaled, which we will use to visualize and further explore our data.
Our first instinct is to compare the two regions together in terms of number of fires, and average temperature.
aggregate(df$Classes ~ df$Region, FUN = sum)
aggregate(df$Temperature ~ df$Region, FUN = mean)
We used the unscaled dataset to plot the real life values of the temperatures.
df %>%
group_by(Region) %>%
summarise(Region = Region, Number_of_fires = sum(Classes), Temperature = mean(Temperature)) %>%
ggplot(aes(x=Region, y=Number_of_fires, fill = Temperature))+
geom_col(position='dodge')
`summarise()` has grouped output by 'Region'. You can override using the `.groups` argument.
We can see that the the Sidi-Bel Abbes region has in total a greater number of fires and a higher average temperature throughout the summer of 2012.
The previous results push us to suspect a positive relationship between the temperature and the likelihood of having a fire. However, we need to investigate all the other variables, which is why we will plot a correlation matrix of the features in the dataset.
corr_mat <- round(cor(df_scaled),2)
p_mat <- cor_pmat(df_scaled)
corr_mat <- ggcorrplot(
corr_mat,
hc.order = FALSE,
type = "upper",
outline.col = "white",
)
ggplotly(corr_mat)
We performed feature selection using the Caret package to determine which features are the most important and which are the least.
In this case, we opted for Linear Discriminant Analysis with Stepwise Feature Selection by specifying stepLDA as our method.
The varImp function returns a measure of importance out of 100 for each of the features. According to the official Caret documentation, the importance metric is calculated by conducting a ROC curve analysis on each predictor; a series of cutoffs is applied to the predictor data to predict the class. The AUC is then computed and is used as a measure of variable importance.
# prepare training scheme
df_scaled$Classes = as.factor(df_scaled$Classes)
control <- trainControl(method="repeatedcv", number=10, repeats=3)
# train the model
modelLDA <- train(Classes~., data=df_scaled, method="stepLDA", trControl=control)
`stepwise classification', using 10-fold cross-validated correctness rate of method lda'.
219 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.89026; in: "ISI"; variables (1): ISI
correctness rate: 0.95909; in: "FFMC"; variables (2): ISI, FFMC
hr.elapsed min.elapsed sec.elapsed
0.00 0.00 1.63
`stepwise classification', using 10-fold cross-validated correctness rate of method lda'.
219 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.90411; in: "ISI"; variables (1): ISI
correctness rate: 0.96342; in: "FFMC"; variables (2): ISI, FFMC
hr.elapsed min.elapsed sec.elapsed
0.0 0.0 1.6
`stepwise classification', using 10-fold cross-validated correctness rate of method lda'.
218 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.90346; in: "ISI"; variables (1): ISI
correctness rate: 0.96342; in: "FFMC"; variables (2): ISI, FFMC
hr.elapsed min.elapsed sec.elapsed
0.00 0.00 1.71
`stepwise classification', using 10-fold cross-validated correctness rate of method lda'.
218 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.90866; in: "ISI"; variables (1): ISI
correctness rate: 0.96797; in: "FFMC"; variables (2): ISI, FFMC
hr.elapsed min.elapsed sec.elapsed
0.00 0.00 1.53
`stepwise classification', using 10-fold cross-validated correctness rate of method lda'.
218 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.90823; in: "ISI"; variables (1): ISI
correctness rate: 0.96342; in: "FFMC"; variables (2): ISI, FFMC
hr.elapsed min.elapsed sec.elapsed
0.00 0.00 1.72
`stepwise classification', using 10-fold cross-validated correctness rate of method lda'.
219 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.90433; in: "ISI"; variables (1): ISI
correctness rate: 0.96364; in: "FFMC"; variables (2): ISI, FFMC
hr.elapsed min.elapsed sec.elapsed
0.00 0.00 1.76
`stepwise classification', using 10-fold cross-validated correctness rate of method lda'.
219 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.90411; in: "ISI"; variables (1): ISI
correctness rate: 0.95887; in: "FFMC"; variables (2): ISI, FFMC
hr.elapsed min.elapsed sec.elapsed
0.00 0.00 1.58
`stepwise classification', using 10-fold cross-validated correctness rate of method lda'.
219 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.90887; in: "ISI"; variables (1): ISI
correctness rate: 0.96818; in: "FFMC"; variables (2): ISI, FFMC
hr.elapsed min.elapsed sec.elapsed
0.00 0.00 2.55
`stepwise classification', using 10-fold cross-validated correctness rate of method lda'.
220 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.91818; in: "ISI"; variables (1): ISI
correctness rate: 0.97273; in: "FFMC"; variables (2): ISI, FFMC
hr.elapsed min.elapsed sec.elapsed
0.0 0.0 1.7
`stepwise classification', using 10-fold cross-validated correctness rate of method lda'.
218 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.89329; in: "ISI"; variables (1): ISI
correctness rate: 0.96342; in: "FFMC"; variables (2): ISI, FFMC
hr.elapsed min.elapsed sec.elapsed
0.00 0.00 1.52
`stepwise classification', using 10-fold cross-validated correctness rate of method lda'.
218 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.90823; in: "ISI"; variables (1): ISI
correctness rate: 0.97706; in: "FFMC"; variables (2): ISI, FFMC
hr.elapsed min.elapsed sec.elapsed
0.00 0.00 1.82
`stepwise classification', using 10-fold cross-validated correctness rate of method lda'.
219 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.89978; in: "ISI"; variables (1): ISI
correctness rate: 0.96364; in: "FFMC"; variables (2): ISI, FFMC
hr.elapsed min.elapsed sec.elapsed
0.00 0.00 1.71
`stepwise classification', using 10-fold cross-validated correctness rate of method lda'.
220 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
Avis : error(s) in modeling/prediction step
correctness rate: 0.90909; in: "ISI"; variables (1): ISI
correctness rate: 0.96364; in: "FFMC"; variables (2): ISI, FFMC
hr.elapsed min.elapsed sec.elapsed
0.00 0.00 1.62
`stepwise classification', using 10-fold cross-validated correctness rate of method lda'.
218 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.90823; in: "ISI"; variables (1): ISI
correctness rate: 0.96364; in: "FFMC"; variables (2): ISI, FFMC
hr.elapsed min.elapsed sec.elapsed
0.00 0.00 1.71
`stepwise classification', using 10-fold cross-validated correctness rate of method lda'.
218 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.90346; in: "ISI"; variables (1): ISI
hr.elapsed min.elapsed sec.elapsed
0.00 0.00 0.94
`stepwise classification', using 10-fold cross-validated correctness rate of method lda'.
220 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.90909; in: "ISI"; variables (1): ISI
correctness rate: 0.96364; in: "FFMC"; variables (2): ISI, FFMC
hr.elapsed min.elapsed sec.elapsed
0.00 0.00 1.59
`stepwise classification', using 10-fold cross-validated correctness rate of method lda'.
218 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.89935; in: "ISI"; variables (1): ISI
correctness rate: 0.97229; in: "FFMC"; variables (2): ISI, FFMC
hr.elapsed min.elapsed sec.elapsed
0.00 0.00 1.68
`stepwise classification', using 10-fold cross-validated correctness rate of method lda'.
218 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.90823; in: "ISI"; variables (1): ISI
correctness rate: 0.97251; in: "FFMC"; variables (2): ISI, FFMC
hr.elapsed min.elapsed sec.elapsed
0.00 0.00 1.63
`stepwise classification', using 10-fold cross-validated correctness rate of method lda'.
219 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.90823; in: "ISI"; variables (1): ISI
hr.elapsed min.elapsed sec.elapsed
0.00 0.00 1.04
`stepwise classification', using 10-fold cross-validated correctness rate of method lda'.
219 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.90866; in: "ISI"; variables (1): ISI
correctness rate: 0.96364; in: "FFMC"; variables (2): ISI, FFMC
hr.elapsed min.elapsed sec.elapsed
0.00 0.00 1.57
`stepwise classification', using 10-fold cross-validated correctness rate of method lda'.
220 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.9; in: "ISI"; variables (1): ISI
correctness rate: 0.96364; in: "FFMC"; variables (2): ISI, FFMC
hr.elapsed min.elapsed sec.elapsed
0.00 0.00 1.68
`stepwise classification', using 10-fold cross-validated correctness rate of method lda'.
218 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.89892; in: "ISI"; variables (1): ISI
correctness rate: 0.95887; in: "FFMC"; variables (2): ISI, FFMC
hr.elapsed min.elapsed sec.elapsed
0.0 0.0 1.6
`stepwise classification', using 10-fold cross-validated correctness rate of method lda'.
219 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.90887; in: "ISI"; variables (1): ISI
hr.elapsed min.elapsed sec.elapsed
0.00 0.00 1.07
`stepwise classification', using 10-fold cross-validated correctness rate of method lda'.
218 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.9039; in: "ISI"; variables (1): ISI
correctness rate: 0.96277; in: "FFMC"; variables (2): ISI, FFMC
hr.elapsed min.elapsed sec.elapsed
0.00 0.00 1.51
`stepwise classification', using 10-fold cross-validated correctness rate of method lda'.
218 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.90909; in: "ISI"; variables (1): ISI
correctness rate: 0.9632; in: "FFMC"; variables (2): ISI, FFMC
hr.elapsed min.elapsed sec.elapsed
0.00 0.00 1.67
`stepwise classification', using 10-fold cross-validated correctness rate of method lda'.
219 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.9039; in: "ISI"; variables (1): ISI
correctness rate: 0.96364; in: "FFMC"; variables (2): ISI, FFMC
hr.elapsed min.elapsed sec.elapsed
0.00 0.00 1.47
`stepwise classification', using 10-fold cross-validated correctness rate of method lda'.
219 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.89978; in: "ISI"; variables (1): ISI
correctness rate: 0.96364; in: "FFMC"; variables (2): ISI, FFMC
hr.elapsed min.elapsed sec.elapsed
0.00 0.00 1.72
`stepwise classification', using 10-fold cross-validated correctness rate of method lda'.
218 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.90823; in: "ISI"; variables (1): ISI
correctness rate: 0.95823; in: "FFMC"; variables (2): ISI, FFMC
hr.elapsed min.elapsed sec.elapsed
0.00 0.00 1.75
`stepwise classification', using 10-fold cross-validated correctness rate of method lda'.
219 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.90823; in: "ISI"; variables (1): ISI
correctness rate: 0.96797; in: "FFMC"; variables (2): ISI, FFMC
hr.elapsed min.elapsed sec.elapsed
0.00 0.00 1.55
`stepwise classification', using 10-fold cross-validated correctness rate of method lda'.
219 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.90844; in: "ISI"; variables (1): ISI
correctness rate: 0.96818; in: "FFMC"; variables (2): ISI, FFMC
hr.elapsed min.elapsed sec.elapsed
0.00 0.00 1.75
`stepwise classification', using 10-fold cross-validated correctness rate of method lda'.
243 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.90483; in: "ISI"; variables (1): ISI
correctness rate: 0.967; in: "FFMC"; variables (2): ISI, FFMC
hr.elapsed min.elapsed sec.elapsed
0.00 0.00 1.67
modelQDA <- train(Classes~., data=df_scaled, method="stepQDA", trControl=control)
`stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
218 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.9816; in: "ISI"; variables (1): ISI
hr.elapsed min.elapsed sec.elapsed
0.00 0.00 1.11
`stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
218 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.97706; in: "FFMC"; variables (1): FFMC
hr.elapsed min.elapsed sec.elapsed
0.00 0.00 1.02
`stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
220 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.97273; in: "FFMC"; variables (1): FFMC
hr.elapsed min.elapsed sec.elapsed
0.00 0.00 1.06
`stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
218 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.97706; in: "FFMC"; variables (1): FFMC
hr.elapsed min.elapsed sec.elapsed
0.00 0.00 0.99
`stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
219 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.97229; in: "FFMC"; variables (1): FFMC
hr.elapsed min.elapsed sec.elapsed
0.00 0.00 0.96
`stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
218 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.97727; in: "FFMC"; variables (1): FFMC
hr.elapsed min.elapsed sec.elapsed
0.00 0.00 1.08
`stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
219 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.97727; in: "FFMC"; variables (1): FFMC
hr.elapsed min.elapsed sec.elapsed
0.00 0.00 0.98
`stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
219 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.97727; in: "FFMC"; variables (1): FFMC
hr.elapsed min.elapsed sec.elapsed
0.00 0.00 1.11
`stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
220 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.97273; in: "FFMC"; variables (1): FFMC
hr.elapsed min.elapsed sec.elapsed
0.00 0.00 1.06
`stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
218 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.97273; in: "FFMC"; variables (1): FFMC
hr.elapsed min.elapsed sec.elapsed
0.00 0.00 1.03
`stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
218 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.97208; in: "ISI"; variables (1): ISI
hr.elapsed min.elapsed sec.elapsed
0.00 0.00 0.97
`stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
218 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.97251; in: "ISI"; variables (1): ISI
hr.elapsed min.elapsed sec.elapsed
0.00 0.00 0.95
`stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
218 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.97229; in: "FFMC"; variables (1): FFMC
hr.elapsed min.elapsed sec.elapsed
0.00 0.00 1.18
`stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
219 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.97273; in: "FFMC"; variables (1): FFMC
hr.elapsed min.elapsed sec.elapsed
0.00 0.00 1.07
`stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
219 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.97229; in: "FFMC"; variables (1): FFMC
hr.elapsed min.elapsed sec.elapsed
0.00 0.00 0.94
`stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
219 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.98182; in: "FFMC"; variables (1): FFMC
hr.elapsed min.elapsed sec.elapsed
0.00 0.00 1.23
`stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
220 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.96364; in: "FFMC"; variables (1): FFMC
hr.elapsed min.elapsed sec.elapsed
0.00 0.00 1.18
`stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
219 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.97727; in: "ISI"; variables (1): ISI
hr.elapsed min.elapsed sec.elapsed
0.00 0.00 1.31
`stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
219 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.97251; in: "FFMC"; variables (1): FFMC
hr.elapsed min.elapsed sec.elapsed
0.00 0.00 1.22
`stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
218 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.9816; in: "FFMC"; variables (1): FFMC
hr.elapsed min.elapsed sec.elapsed
0.00 0.00 1.25
`stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
219 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.97251; in: "FFMC"; variables (1): FFMC
hr.elapsed min.elapsed sec.elapsed
0.0 0.0 1.3
`stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
220 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.97273; in: "FFMC"; variables (1): FFMC
hr.elapsed min.elapsed sec.elapsed
0.00 0.00 1.06
`stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
218 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.97706; in: "FFMC"; variables (1): FFMC
hr.elapsed min.elapsed sec.elapsed
0.00 0.00 1.04
`stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
218 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.98139; in: "FFMC"; variables (1): FFMC
hr.elapsed min.elapsed sec.elapsed
0.00 0.00 1.02
`stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
219 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.97273; in: "ISI"; variables (1): ISI
hr.elapsed min.elapsed sec.elapsed
0.00 0.00 1.04
`stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
218 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.98636; in: "FFMC"; variables (1): FFMC
hr.elapsed min.elapsed sec.elapsed
0.00 0.00 0.94
`stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
219 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.97727; in: "FFMC"; variables (1): FFMC
hr.elapsed min.elapsed sec.elapsed
0 0 1
`stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
218 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.97684; in: "FFMC"; variables (1): FFMC
hr.elapsed min.elapsed sec.elapsed
0.00 0.00 1.25
`stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
218 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.97251; in: "FFMC"; variables (1): FFMC
hr.elapsed min.elapsed sec.elapsed
0 0 1
`stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
220 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.97273; in: "FFMC"; variables (1): FFMC
hr.elapsed min.elapsed sec.elapsed
0.00 0.00 1.14
`stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
243 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.97517; in: "FFMC"; variables (1): FFMC
hr.elapsed min.elapsed sec.elapsed
0.00 0.00 1.25
importanceLDA <- varImp(modelLDA, scale=FALSE)
plot(importanceLDA)
We can see that the variables month, Ws, Region, and day are insignificant compared to other features. We will disregard them in our model.
We first start by performing Logistic Regression on our dataset. We begin by splitting the data into train/test sets with a 80/20 split. This split was chosen by default as a good practice.
set.seed(6)
split <- sample.split(df_scaled, SplitRatio=0.8)
train_set <- subset(df_scaled, split == "TRUE")
test_set <- subset(df_scaled, split=="FALSE")
head(train_set)
We create our model with the features that were the most important during our feature selection step. Then, we fit the model to our training data.
After that, we test our model on the test set and use a threshold of 0.5 to set our predictions. This will result in having only one False Positive and one False Negative prediction. By the end, our model has reached an accuracy of 94% on our test data.
logistic_model <- glm(Classes ~ Temperature+Rain+FFMC+DMC+DC+ISI+BUI+FWI+RH, data=train_set, family="binomial")
Avis : glm.fit: l'algorithme n'a pas convergéAvis : glm.fit: des probabilités ont été ajustées numériquement à 0 ou 1
summary(logistic_model)
Call:
glm(formula = Classes ~ Temperature + Rain + FFMC + DMC + DC +
ISI + BUI + FWI + RH, family = "binomial", data = train_set)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.133e-04 -2.100e-08 2.100e-08 2.100e-08 2.268e-04
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 215.37 151740.22 0.001 0.999
Temperature -45.81 18387.96 -0.002 0.998
Rain 47.30 49155.97 0.001 0.999
FFMC 90.66 274359.46 0.000 1.000
DMC -62.72 63321.30 -0.001 0.999
DC 71.07 40576.40 0.002 0.999
ISI 342.22 351820.91 0.001 0.999
BUI -34.52 120610.62 0.000 1.000
FWI 152.95 276136.12 0.001 1.000
RH -15.04 13773.46 -0.001 0.999
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2.6436e+02 on 190 degrees of freedom
Residual deviance: 1.4546e-07 on 181 degrees of freedom
AIC: 20
Number of Fisher Scoring iterations: 25
predict <- predict(logistic_model, test_set, type="response")
predict
2 6 7 16 20 21 30
2.220446e-16 9.996458e-01 1.000000e+00 2.220446e-16 2.220446e-16 1.000000e+00 2.220446e-16
34 35 44 48 49 58 62
1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
63 72 76 77 86 90 91
1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
100 104 105 114 118 119 128
1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 2.220446e-16 2.220446e-16 1.000000e+00
132 133 142 146 147 156 160
1.000000e+00 1.000000e+00 8.681744e-06 1.000000e+00 1.000000e+00 1.000000e+00 2.220446e-16
161 170 174 175 184 188 189
2.220446e-16 1.000000e+00 2.220446e-16 2.220446e-16 2.220446e-16 1.000000e+00 1.000000e+00
198 202 203 212 216 217 226
1.000000e+00 1.000000e+00 1.000000e+00 2.220446e-16 2.220446e-16 2.220446e-16 2.220446e-16
230 231 240
1.000000e+00 1.000000e+00 1.000000e+00
predict <- ifelse(predict >0.5,1,0)
predict
2 6 7 16 20 21 30 34 35 44 48 49 58 62 63 72 76 77 86 90 91 100 104 105
0 1 1 0 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
114 118 119 128 132 133 142 146 147 156 160 161 170 174 175 184 188 189 198 202 203 212 216 217
1 0 0 1 1 1 0 1 1 1 0 0 1 0 0 0 1 1 1 1 1 0 0 0
226 230 231 240
0 1 1 1
table(test_set$Classes,predict)
predict
0 1
0 15 0
1 1 36
misclassifications <- mean(predict != test_set$Classes)
print(paste('Accuracy =',1-misclassifications))
[1] "Accuracy = 0.980769230769231"
ROCPred <- prediction(predict,test_set$Classes)
ROCPer <- performance(ROCPred, measure="tpr",x.measure="fpr")
auc <- performance(ROCPred, measure = "auc")
auc <- auc@y.values[[1]]
auc
[1] 0.9864865
plot(ROCPer, colorize = TRUE)
[INTERPRETATION]
Since LDA assumes that each input variable has the same variance, we will use the standardized data-frame in the train test splits. Each variable in the standardized data-frame has mean of 0 and variance of 1.
The train test ratio chosen is 0.8:0.2 leaving 52 observations as unseen data for testing.
set.seed(6)
split <- sample.split(df_scaled, SplitRatio=0.8)
train_set <- subset(df_scaled, split == "TRUE")
test_set <- subset(df_scaled, split=="FALSE")
dim(train_set)
[1] 191 14
dim(test_set)
[1] 52 14
We will train our model on the significant features selected in the feature selection phase. The data is standardized of course.
lda_model = lda (Classes ~ Temperature+Rain+FFMC+DMC+DC+ISI+BUI+FWI+RH, data=train_set)
lda_model
Call:
lda(Classes ~ Temperature + Rain + FFMC + DMC + DC + ISI + BUI +
FWI + RH, data = train_set)
Prior probabilities of groups:
0 1
0.4764398 0.5235602
Group means:
Temperature Rain FFMC DMC DC ISI BUI FWI
0 -0.5780859 0.4348542 -0.8150123 -0.6690568 -0.6056867 -0.8222968 -0.6779796 -0.8138050
1 0.4762177 -0.3204676 0.6659827 0.6215861 0.5279307 0.6175900 0.6185834 0.6580962
RH
0 0.4522544
1 -0.3615521
Coefficients of linear discriminants:
LD1
Temperature 0.1021470
Rain 0.1406894
FFMC 1.1880939
DMC -1.2053965
DC -0.2464361
ISI 0.5731491
BUI 1.2481795
FWI 0.6869747
RH 0.5112902
Interpretation on the coefficients:
After getting our predictions, we will use the confusion matrix function from the caret library that computes a set of performance matrices including f1-score, recall and precision. Other matrices computed include: sensitivity, specificity, prevalence etc. The official documentation for this function and the formulas for all matrices are found in this link: https://rdrr.io/cran/caret/man/confusionMatrix.html. We will only be interested in the f1-score, recall, precision, accuracy and balanced accuracy.
As we can see below, our the number of false positives is 0, and the number of false negatives is 2. The results are very good but the other way around would have been better as we do not want to miss any positives meaning we want to predict all fires. Our model yielded an f1-score of 96% and an accuracy of 96% as well.
####Potting the ROC curve
pred <- prediction(preds$posterior[,2], test_set$Classes)
perf <- performance(pred,"tpr","fpr")
plot(perf,colorize=TRUE)
Interpreation: